Comprehensible and Accurate Cluster Labels in Text Clustering

نویسندگان

  • Jerzy Stefanowski
  • Dawid Weiss
چکیده

The purpose of text clustering in information retrieval is to discover groups of semantically related documents. Accurate and comprehensible cluster descriptions (labels) let the user comprehend the collection’s content faster and are essential for various document browsing interfaces. The task of creating descriptive, sensible cluster labels is difficult—typical text clustering algorithms focus on optimizing proximity between documents inside a cluster and rely on keyword representation for describing discovered clusters. In the approach called Description Comes First (DCF) cluster labels are as important as document groups—DCF promotes machine discovery of comprehensible candidate cluster labels later used to discover related document groups. In this paper we describe an application of DCF to the k-Means algorithm, including results of experiments performed on the 20-newsgroups document collection. Experimental evaluation showed that DCF does not decrease the metrics used to assess the quality of document assignment and offers good cluster labels in return. The algorithm utilizes search engine’s data structures directly to scale to large document collections. Introduction Organizing unstructured collections of textual content into semantically related groups, from now on referred to as text clustering or clustering, provides unique ways of digesting large amounts of information. In the context of information retrieval and text mining, a general definition of clustering is the following: given a large set of documents, automatically discover diverse subsets of documents that share a similar topic. In typical applications input documents are first transformed into a mathematical model where each document is described by certain features. The most popular representation for text is the vector space model [Salton, 1989]. In the VSM, documents are expressed as rows in a matrix, where columns represent unique terms (features) and the intersection of a column and a row indicates the importance of a given word to the document. A model such as the VSM helps in calculation of similarity between documents (angle between document vectors) and thus facilitates application of various known (or modified) numerical clustering algorithms. While this is sufficient for many applications, problems arise when one needs to construct some representation of the discovered groups of documents—a label, a symbolic description for each cluster, something to represent the information that makes Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 Copyright C.I.D. Paris, France documents inside a cluster similar to each other and that would convey this information to the user. Cluster labeling problems are often present in modern text and Web mining applications with document browsing interfaces. The process of returning from the mathematical model of clusters to comprehensible, explanatory labels is difficult because text representation used for clustering rarely preserves the inflection and syntax of the original text. Clustering algorithms presented in literature usually fall back to the simplest form of cluster representation—a list of cluster’s keywords (most “central” terms in the cluster). Unfortunately, keywords are stripped from syntactical information and force the user to manually find the underlying concept which is often confusing. Motivation and Related Works The user of a retrieval system judges the clustering algorithm by what he sees in the output— clusters’ descriptions, not the final model which is usually incomprehensible for humans. The experiences with the text clustering framework Carrot (www.carrot2.org) resulted in posing a slightly different research problem (aligned with clustering but not exactly the same). We shifted the emphasis of a clustering method to providing comprehensible and accurate cluster labels in addition to discovery of document groups. We call this problem descriptive clustering: discovery of diverse groups of semantically related documents associated with a meaningful, comprehensible and compact text labels. This definition obviously leaves a great deal of freedom for interpretation because terms such as meaningful or accurate are very vague. We narrowed the set of requirements of descriptive clustering to the following ones: — comprehensibility understood as grammatical correctness (word order, inflection, agreement between words if applicable); — conciseness of labels. Phrases selected for a cluster label should minimize its total length (without sacrificing its comprehensibility); — transparency of the relationship between cluster label and cluster content, best explained by ability to answer questions as: “Why was this label selected for these documents?” and “Why is this document in a cluster labeled X?”. Little research has been done to address the requirements above. In the STC algorithm authors employed frequently recurring phrases as both document similarity feature and final cluster description [Zamir and Etzioni, 1999]. A follow-up work [Ferragina and Gulli, 2004] showed how to avoid certain STC limitations and use non-contiguous phrases (so-called approximate sentences). A different idea of ‘label-driven’ clustering appeared in clustering with committees algorithm [Pantel and Lin, 2002], where strongly associated terms related to unambiguous concepts were evaluated using semantic relationships from WordNet. We introduced the DCF approach in our previous work [Osiński and Weiss, 2005] and showed its feasibility using an algorithm called Lingo. Lingo used singular value decomposition of the term-document matrix to select good cluster labels among candidates extracted from the text (frequent phrases). The algorithm was designed to cluster results from Web search engines (short snippets and fragmented descriptions of original documents) and proved to provide diverse meaningful cluster labels. Lingo’s weak point is its limited scalability to full or even medium sized documents. In this Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 Copyright C.I.D. Paris, France

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

Modification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis

Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...

متن کامل

Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing

Clustering of class labels can be generated automatically, which is much lower quality than labels specified by human. If the class labels for clustering are provided, the clustering is more effective. In classic document clustering based on vector model, documents appear terms frequency without considering the semantic information of each document. The property of vector model may be incorrect...

متن کامل

Pitch Contour Model for Chi Using Cart and Statis

This paper describes an approach to generating prosody parameters for Mandarin Chinese text-to-speech system. The Chinese fundamental frequency contour is decomposed into two parts, a global intonation contour and a syllable level tone contour. The global intonation contour is converted to pitch target labels in corpus. It is predicted by first predicting pitch target labels using statistical m...

متن کامل

Hierarchical Document Clustering: A Review

As text documents are largely increasing in the internet, the process of grouping similar documents for versatile applications have put the eye of researchers in this area. However most clustering methods suffer from challenges in dealing with problems of high dimensionality, scalability, accuracy and meaningful cluster labels. This paper presents a review on all these well known methods of doc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007